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Abstract 

This paper describes two applications of conditional 
restricted Boltzmann machines (CRBMs) to the task 
of autotagging music. The first consists of training a 
CRBM to predict tags that a user would apply to a 
clip of a song based on tags already applied by other 
users. By learning the relationships between tags, this 
model is able to pre-process training data to signifi- 
cantly improve the performance of a support vector 
machine (SVM) autotagging. The second is the use of 
a discriminative RBM, a type of CRBM, to autotag 
music. By simultaneously exploiting the relationships 
among tags and between tags and audio-based fea- 
tures, this model is able to significantly outperform 
SVMs, logistic regression, and multi-layer perceptrons. 
In order to be applied to this problem, the discrimina- 
tive RBM was generalized to the multi-label setting 
and four different learning algorithms for it were eval- 
uated, the first such in-depth analysis of which we are 
aware. 

1 Introduction 

With the sizes of online music and media databases 
growing to millions and billions of items, users need 
tools for searching and browsing these items in in- 
tuitive ways. One approach that has proven to be 
popular is the use of social tags [1], short descrip- 
tions applied by users to items. Users can search and 
browse through a collection using the tags that they 
or others have applied. This system works well for 
popular items that have been tagged by many users, 
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but fails for items that are new or niche, this is the 
so-called cold start problem [2]. 

One promising way to overcome the cold start is 
through content-based analysis and tagging of the 
items in the collection, known as autotagging. Re- 
searchers have investigated a number of autotagging 
techniques for music over the last decade [3, 4, 5]. 
While a few autotagging techniques attempt to cap- 
ture the relationship between tags (e.g. [6]), many 
treat each tag as a separate classification or ranking 
problem (e.g. [7]). The problem of predicting the 
presence or relevance of multiple tags simultaneously 
is known as the multi-label classification problem [8]. 

This paper explores techniques for autotagging mu- 
sic that incorporate the relationships between tags. 
We approach this problem in two ways, both of which 
are based on conditional restricted Boltzmann ma- 
chines (RBMs) described in Section 2. The first ap- 
proach, described in Section 2.1, is a novel model 
trained to predict the tags that a user will apply to 
music based on the tags other users have applied to it. 
It is a purely textual model in that it does not utiHze 
the audio at all to make predictions. These predicted 
tags, which we call "smoothed" tags, are then used 
to train different types of classifiers that do utilize 
audio. 

The second approach, described in Section 3, is a 
discriminative RBM [9], which learns to jointly predict 
tags from features extracted from the audio. We ex- 
tend the discriminative RBM to perform multi-label 
classification instead of the winner-take-all classifi- 
cation performed by previous discriminative RBMs. 
This new model requires a new training algorithm. 
We explore four techniques for approximating the gra- 
dient of the model parameters, namely maximum 



likelihood using contrast ive divergence, maximum 
pseudo-likelihood, mean-field contrastive divergence, 
and loopy belief propagation approximations. 

Section 4 investigates the performance of these two 
methods separately and together on three different 
datasets, two of which have been previously described 
in the literature, and one of which (the largest of the 
three) is new and has not been used to train or test 
autot aggers before. 

2 Restricted Boltzmann ma- 
chines 

This section describes the restricted Boltzmann ma- 
chine (RBM) [10], its conditional variant the condi- 
tional REM, and one particular type of conditional 
RBM, the discriminative RBM. The RBM is an undi- 
rected graphical model that generatively models a 
set of input variables y = . . . , yc)^ with a set 
of hidden variables h = (/ii, . . . , /i^)-^. Both y and 
h are typically binary, although other distributions 
are possible. The model is "restricted" in that the 
dependency between the hidden and visible variables 
is bipartite, meaning that the hidden variables are 
independent when conditioned on the visible variables 
and vice versa. The joint probability density function 
is 

p(y,h) = 4e-^(y''^^ (1) 



where 



£(y,h) = -h^C/y-c^h-d^y, 



(2) 
(3) 



is a matrix of real numbers, and c and d are vectors 



The parameters of the model can be optimized using 
gradient descent to minimize the negative log likeli- 
hood of data {yt} under this model 



h|yt 



(5) 

The first expectation in this expression is easy to 
compute, but the second is intractable and must be 
approximated. One popular approximation for it is 
contrastive divergence [11], which uses a small number 
of Gibbs sampling steps starting from the observed 
example to sample from p(y, h). 

RBMs can be conditioned on other variables [12]. 
In general, as shown in Figure 1(b), both the hid- 
den and visible units can be conditioned on other 
variables, u = ('Ui, . . . , Vd)^ and a = (ai, . . . , a^)^, 
respectively. Including these interactions, the energy 
function becomes 



£;(y,h,u,a) = -h^t/y-h^iyu-y^Fa- 



-d^y-c^h 
(6) 

and p(y, h | u, a) ex e ^(y'^'"'^), where W and V are 
real matrices. The vectors Vs. and W\i act like addi- 
tional biases on y and h. By setting the appropriate 
W or V matrix or conditioning vector u or a to 0, 
the conditioning can apply to only the visible units, 
as in Figure 1(a), or only the hidden units, as in 
Figure 1(c). For an observed data point y^,ut,a^, 
the gradient of the log likelihood with respect to a 
parameter becomes 



dO 



logp(yt I ut,a^) 



h I yt,ut,at 



d_ 

dO 



£;(yt,h,ut,at) 
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of real numbers. The computation of Z, known as the r, , , , , . , 

. . „ . .-Ill 1 1 1^ describes a conditional RBM used tor collabora- 

partition function, is intractable due to the number • • i • i ^ ^^ ^ ■ ^ ^ ■ ^ ^ 

„ , r ■ tive nltermg m which only the hidden variables are 

of terms being exponential m the number of units. .... . . . . 

• 1 r- 1 . / N -T(^^ irv conditioned on other variables. 

Ihe marginal of however, is p(y) = e ^^^> jZ^ 

where ^(y) is the free energy of y and can be easily 

computed as 



^(y) = -log^, 
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2.1 Conditional 
smoothing 



RBMs for tag 
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Figure 1: Schematic diagrams of the various restricted Boltzmann machines under investigation, (a) RBM 
for tag smoothing conditioned on just auxihary information: user, track, chps identity, (b) RBM for tag 
smoothing conditioned on auxihary information and tags of other users, (c) discriminative RBM for audio 
classification. Fihed circles show variables that are always observed, open circles show variables that are 
inferred at test time. 



and audio clips of tracks. This model is purely textual, 
meaning that it only operates on the tags and not on 
the audio. 

All of the datasets used in this paper were collected 
by open endedly soliciting tags from users to describe 
audio clips. This means that the tags that they con- 
tain are most likely relevant, but the tags that are 
not present are not necessarily irrelevant. Thus there 
is a need to distinguish tags omitted but still relevant 
from those that do not apply, as well as tags that 
were included erroneously from those that truly apply. 
As shown in [14], the co-occurrences of tags can be 
used to predict both of these cases. For example, if 
the tags rap and hip- hop frequently co-occur and a 
clip has been tagged hip- hop but not rap, it would 
be reasonable to increase the likelihood of rap being 
relevant to that clip, although perhaps not as much 
as if it had actually been applied by a user. Similarly, 
it might be reasonable to decrease the likelihood of 
hip- hop being relevant as it was not corroborated by 
an application of rap. 

We use the doubly conditional RBM shown in Fig- 
ure 1(b) for this sort of tag "smoothing" as we call it. 
The binary visible units represent the tags that one 
user has applied to a clip and the hidden units capture 
second order relationships between these tags. The 
visible units are conditioned on auxiliary variables 
a which represent as one- hot vectors the user, track, 
and clip from which a vector of tags is observed. The 
hidden units are conditioned on auxihary variables u, 
which represent the tags that other users have applied 



to the same clip. 

The vectors y and u are the same size, but whereas 
y is a binary vector representing which of the fixed 
vocabulary of tags the target user applied to the target 
clip, u is a vector of the average of these binary vectors 
for all of the other users who have seen the target clip. 
Thus the values in u are still between and 1, but 
are continuous- valued. At test time, u is set to the 
average tag vector of all of the users and predicts the 
tags that a new user would likely apply to the clip 
given the tags that other users have already applied. 

The weights V and W are penalized with an ii 
cost to encourage them to only capture dependencies 
that depend on specific settings of the auxiliary vari- 
ables and push into U the dependencies that exist 
independently of the auxiliary variables. This means 
that V should ideally only capture tag information 
relevant to a particular user, clip, or track, W should 
capture information about the relationships between 
other user's tags and the current user's tags, and U 
should capture information about the co-occurrences 
of tags in general. 

Compare this to the singly conditional RBM shown 
in Figure 1(a) and described in [14]. This CRBM 
also includes the conditioning of the visible units on 
the user, clip, and track information, but does not in- 
clude the conditioning on other users' tags. While the 
doubly conditional RBM can use its modeling power 
to learn to predict specific user's tags from general 
tag patterns, this singly conditional model must pre- 
dict both the general tag patterns and specific user's 
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tags, a harder problem. We found that the doubly- 
conditional RBM's smoothing trains better SVMs on 
a validation experiment, and so we did not include 
the singly-conditional RBM in our experiments. 

3 Discriminative RBMs 

One useful variant of the conditional RBM is the 
discriminative RBM [9], shown in Figure 1(c). The 
discriminative RBM is a conditional RBM that is 
trained to predict the probability of the class labels, 
y, from the rest of the inputs, x. Based on the energy 
function of (6), it corresponds to setting u = x and 
a = 0. 

For a set of observed data points {(y^,Xt)}, the 
discriminative RBM optimizes the log conditional, 
\ogp{yt |x^), i.e. focusing on predicting y^ from x^ 
well. A generative variant of this RBM would in- 
stead optimize logp(y^,x^), the joint distribution (in 
this case, x^ acts as an extension of y^, i.e. it is not 
conditioned on it). 

Looking at the parameter gradient of (7), we see 
that the second expectation requires a sum over all 
configurations of y. When y can take only a few 
values, as in ordinary classification tasks [9], this ex- 
pectation can be computed efficiently and exactly. 
However, here y is a set of C binary indicators (the 
presence of a tag) that are not mutually exclusive, so 
that the expectation has 2^ terms and must be ap- 
proximated because it cannot be computed in closed 
form. Note that given a value for y, p(h|y, x^) factors 
and is computed exactly. 

3.1 Approximations to the expecta- 
tion 

In the case of the discriminative RBM involving y, 
h, u = X, and a = 0, we approximate the IEy,h|xt 
term in (7) in three different ways: using contrast ive 
divergence, mean-field contrast ive divergence, and 
loopy belief propagation. We also compare a simi- 
lar, but tractable computation that maximizes the 
pseudo-likelihood. The difficulty in computing this 
expectation stems directly from the difficulty in com- 
puting p(y,h|xt), which in turn is caused by the 



interdependence of the y and h variables. 

Contrastive divergence (CD) [11] has proven to 
be a very popular algorithm for estimating the log- 
likelihood gradient in RBMs, and it can also be used 
in the case of conditional RBMs. Typically, it is 
used to compute IEy,x,h['] as opposed to here, where 
we compute IEy,h|xt["]- To compute the usual CD-/c 
update, k steps of block Gibbs sampling, starting from 
the observed example (x^, y^), are used to approximate 
the expectation. The block Gibbs chain is obtained 
by alternating sampling from p{h. \ y, x) and sampling 
from p(y, x | h). In the case of the conditional CD, we 
sample from p{h. | y, x^) and then from p{y \ h) (since 
h isolates y from x^), keeping x^ fixed throughout. 
CD can be noisy because it uses a small number 
of samples (usually only one), and it can be biased 
because it doesn't necessarily run the Markov chain 
to convergence (usually only 1 to 10 steps). 

The mean-field contrastive divergence approach ap- 
proximates the y and h variables using their condi- 
tional expectations (given each other) and iteratively 
updates each one based on the estimate of the other 
until convergence (note that x^ is fixed). 

= E[h I y^-\^t] = sigm(c + Uy^'' + W^t) (8) 
y^ = E[y I h^ X,] = sigm(6 + Wh^). (9) 

In this case, we plug the continuous- valued expec- 
tations into these equations instead of the sampled 
binary values that should formally be used. While 
this method is straightforward, it cannot capture mul- 
timodal distributions in y and h, which makes it 
sensitive to initialization. We set the initial condition 
y~^ = y^, i.e. we initialize y at the training label 
from which we compute h^, etc., which is why this is 
referred to as mean- field contrastive divergence. We 
also tried to use standard mean- field where y~^ = 0, 
but found the results to be much worse. 

Loopy belief propagation [15] (LBP) is another al- 
gorithm for approximating intractable marginals in 
a graphical model. It relies on a message passing 
procedure between the variables of the graph. While 
not guaranteed to converge it frequently does in prac- 
tice, and gives estimates of the true marginals that 
are often more accurate than the iterative mean-field 
procedure [16]. In this setting, we used LBP to esti- 
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mate the marginals p{yj = l|xt), p{hk = l|x^) and 
p{yj = l^hk = l|xt) for a given under the dis- 
criminative RBM, and used those marginals to com- 
pute the IEy,h|xt term in equation (7). The quantity 
p{yj = l|x^) can also be estimated at test time to pre- 
dict the labels. One method that has been shown to be 
useful in aiding convergence is message damped belief 
propagation [17]. In this case the updates computed 
by belief propagation are mixed with the previous 
updates for the same variables in order to smooth 
them, the damping factor being a parameter of the 
algorithm. 

Another method for tuning the parameters aims to 
optimize not the likelihood of the data, but the pseudo- 
likelihood [18]. The pseudo-likelihood circumvents the 
intractability of computing the partition function in 
(3) by considering only configurations of the visible 
units that are within a Hamming distance of 1 from 
the training observation. 

logPL(y |x) = ^logp(y,- |y\,-,x) (10) 

j 

= ^logp(y |x) - log (p(y |x) +p(y^- |x)) 

3 

where y\j is the labels vector y without the jth vari- 
able and Yj is the labels vector y with the jth bit 
fiipped (the subscript t is removed here for clarity). 
The pseudo-likelihood can be optimized using gradient 
descent. 

Because of lack of space, we give pseudocodes for 
all the aforementioned algorithms in the appendix. 
Additionally, the python code used for training these 
models is available on our website^. Note that while 
all of these methods can be used for training, not 
all of them can be used at test time to estimate 
P{yj = l|xt). Specifically, the pseudo- likelihood re- 
quires the knowledge of y\^, which is unavailable at 
test time. Similarly, CD must be initialized from the 
true values of jt and x^. It is possible to use a Gibbs 
sampling method similar to CD starting from an arbi- 
trary initialization of y^, but this is costly because the 
Markov chain may need to be run for many iterations 
before it mixes well. We found that mean-field CD 

^ http : / / www . iro . umontreal . ca/ -bengioy/ code/ drbin_ 
tags 



could be successfully initialized with y ^ = at test 
time. 

4 Experiments 

We performed a number of experiments to compare 
different hyper-parameter settings, to compare differ- 
ent classifiers, and to compare different tag smoothing 
techniques. These experiments were based on three 
different datasets: data from Amazon. com's Mechani- 
cal Turk service^, data from the MajorMiner music 
labeling game^, and data from Last.fm's users^. We 
compare the discriminative RBM to standard (gen- 
erative) RBMs, multi-layer perceptrons, logistic re- 
gression, and support vector machines. All of these 
algorithms were evaluated in terms of retrieval perfor- 
mance using the area under the ROC curve. 

4.1 Datasets 

Three datasets were used in these experiments. All 
of these datasets were in the form of (user, item, tag) 
triples, where the items were either 10-second clips of 
tracks or whole tracks. These data were condensed 
into (item, tag, count) triples by summing across 
users. 

The first dataset was collected from Amazon. com's 
Mechanical Turk service and is described in [14]. Users 
were asked to describe 10-second clips of songs in 
terms of 5 broad categories including genre, emotion, 
instruments, and overall production. The music used 
in the experiment consisted of 185 songs selected 
randomly from the music blogs indexed by the Hype 
Machine^. From each track, five 10-second clips were 
extracted from proportionally equally spaced points, 
for a total of 925 clips. Each clip was seen by a total 
of 3 users, generating approximately 15,500 (user, clip, 
tag) triples from 210 unique users. We used the most 
popular 77 tags for this dataset. 

The second dataset was collected from the Ma- 
jorMiner music labeling game and is described in 

■^http : //mturk . com 
^http : //maj orminer . org 
^http://last .fm 
^http : //hypem . com 
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[7]. Players were asked to describe 10-second clips 
of songs and were rewarded for agreeing with other 
players and for being original. This dataset includes 
approximately 80,000 (user, clip, tag) triples with 
2600 unique clips, 650 unique users, and 1000 unique 
tags. We used the most popular 77 tags for this 
dataset. 

The final dataset was collected from Last.fm's web- 
site and is described in [19]. The entire dataset con- 
sists of approximately 7 million (user, track, tag) 
triples from 84,000 unique users, 1 million unique 
tracks, and 280,000 unique tags. While only the tex- 
tual information was collected from Last.fm, we were 
able to match it to 47,000 tracks in our own music col- 
lection. While this may seem like a small fraction of 
the total number of tracks, the tracks that were found 
included 1.5 million of the (user, track, tag) triples, 
implying that the tracks we were able to match were 
tagged more often than average. Following similar 
reasoning, many of these users, tracks, and tags oc- 
curred infrequently, with 1 million (user, track, tag) 
triples in which all three items occurred in at least 25 
triples. Because these tags were applied at the track 
level and not at the clip level, we selected one clip 
from the center of each track and assumed that they 
should all be described with the track tags. This is 
the simplest solution to this problem, although using 
some form of multiple-instance learning might find a 
better solution [20]. We used the most popular 100 
tags for this dataset. 

Converting (item, tag, count) triples to binary ma- 
trices for training and evaluation purposes required 
some care. In the MajorMiner and Last.fm data, the 
counts were high enough that we could require the 
verification of an (item, tag) pair by at least two peo- 
ple, meaning that the count had to be at least 2 to 
be considered as a positive example. The Mechanical 
Turk dataset did not have high enough counts to allow 
this, so we had to count every (item, tag) pair. In the 
MajorMiner and Last.fm datasets, (item, tag) pairs 
with only a single count were not used as negative 
examples because we assumed that they had higher 
potential relevance than (item, tag) pairs that never 
occurred, which served as stronger negative examples. 



Features The timbral and rhythmic features of [7] 
were used to characterize the audio of 10-second song 
clips. The timbral features were the mean and raster- 
ized full covariance of the clip's mel frequency cepstral 
coefficients. They capture information about instru- 
mentation and overall production qualities. The rhyth- 
mic features are based on the modulation spectra in 
four large frequency bands. In fact, they are closely re- 
lated to the autocorrelation in those frequency bands. 
They capture information about the rhythm of the 
various parts of the drum kit (if present), i.e. bass 
drum, tom tom, snare, hi-hat. They also discriminate 
between music that has a strong rhythmic component, 
e.g. dance music, and music that does not, e.g. folk 
rock. Each dimension of both sets of features was 
normalized across the database to have zero-mean 
and unit- variance, and then each feature vector was 
normalized to be unit norm to reduce the effect of out- 
liers. The timbral features were 189-dimensional and 
the rhythmic features were 200-dimensional, making 
the combined feature vector 389-dimensional. 

4.2 Classifiers 

We compared a number of classifiers including two 
variants of restricted Boltzmann machines, and three 
other standard classifiers. The RBMs we compared 
were the discriminative RBM, described in Section 3 
and a standard generative RBM. Both RBMs use 
Gaussian input units [21] in order to deal with the 
continuous- valued features for x. The other classifiers 
include a multi-layer perceptron, logistic regression, 
and support vector machines. For all datasets we 
select the hyper-parameters of the model using a 5- 
fold cross-vahdation. In order to increase accuracy of 
our measure, for each fold we computed the score as an 
average across 4 sub-folds. Each run used a different 
fold (from the remaining 4 folds) as the validation set 
and the other 3 as the training set. 

The discriminative RBM uses the gradient updates 
shown in (7), while the generative RBM uses a differ- 
ent update in which the second expectation is IEy,x,h 
instead of Ey h|xf The generative RBM attempts to 
maximize logp(y,x), while the discriminative RBM 
attempts to maximize logp(y | x). It is also possible 
to use a mixture of these two objective functions and 
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Table 1: Parameter settings found to perform best on 
validation sets and used in experiments for discrimi- 
native and generative RBMs, multi-layer perceptrons, 
and logistic regression. LR stands for learning rate. 







Number of hidden units 


Model 


LR 


MTurk 


MajMin 


Last.fm 


Disc. RBM 


0.01 


50 


100 


200 


Gen. RBM 


0.01 


200 


300 


300 


MLP 


0.001 


250 


250 


250 


Log. reg. 


2.0 









maximize alogp(y,x) -h logp(y | x), referred to as a 
hybrid generative/discriminative RBM [9]. In our 
experiments, however, the hybrid RBM did not im- 
prove on the DRBM, so we will not discuss it further. 
For each model and dataset pair we optimized the 
hyper-parameters using the cross validation described 
above, selecting the hyper-parameters with the best 
performance on the validation set averaging across 
folds and tags. Different hyper-parameters performed 
best in each case, which is to be expected given the 
differences in the models and in the data. For exam- 
ple, one would expect the generative RBM to require 
more hidden units than the discriminative RBM be- 
cause it models the joint probability. Also on a large 
dataset, one would expect to be able to use more hid- 
den units without overfitting. The hyper-parameters 
that performed best on the validation set can be seen 
in Table 1. 

The multi-layer perceptron (MLP) is quite similar 
in structure to the discriminative RBM in that it has 
nodes representing the features and the classes and 
hidden nodes that capture interactions between them. 
The main difference is that in estimating p(y | x) there 
is no modeling of the interactions between the ele- 
ments of y given x. In the discriminative RBM, at test 
time the unknown y and h interact with one another 
through one of the methods described in Section 3.1 
until they reach a mutually agreeable equilibrium. In 
the case of the MLP, however, at test time h is com- 
puted deterministically from x and y is computed 



deterministically from h. The stochastic hidden units 
in the discriminative RBM at test time allow it to 
better capture interactions between the variables in y 
(i.e. the tags). 

An even simpler classifier than the MLP is logistic 
regression, which has no hidden layer and predicts 
each class directly from the input features. We simi- 
larly optimize this using gradient descent, where the 
cost function is the cross-entropy between the target 
labels and the predictions, like for the MLP. 

The final classifier we compared is the support vec- 
tor machine (SVM). Specifically we used a linear ker- 
nel and a i^-SVM [22] to automatically select the C 
parameter. We trained a different SVM for each tag 
as an independent two-way decision (e.g. rock vs not 
rock). While the above methods based on stochastic 
gradient descent can be trained on all examples, SVMs 
are more sensitive to the relative number of positive 
and negative examples, so we had to more carefully 
select the training examples to use for each tag. To 
do this, we selected as positive examples those clips to 
which users applied a given tag most frequently and as 
negative examples those clips to which users applied 
a given tag least frequently (generally times). The 
actual training labels used, however, were still the 
standard ±1 targets. We ensured that there were the 
same number of positive and negative examples, up 
to 200 of each. 

Metrics The performance of all of these algorithms 
on all of these datasets is evaluated in terms of re- 
trieval performance using the area under the ROC 
curve (AUG) [23]. This metric scores the ability of an 
algorithm to rank relevant examples in a collection 
above irrelevant examples. A random ranking will 
achieve an AUG of approximately 0.5, while a perfect 
ranking will achieve an AUG of 1.0. In certain ex- 
periments GRBMs were used to smooth the training 
data, but the testing data was always the unsmoothed, 
user-supplied tags. We measure the AUG for each tag 
separately. We use the average across tags and folds 
as a overall measure of performance and consider the 
standard error across folds for Figure 3. For a more 
detailed comparison we use a two-sided paired t-test 
across folds, per tag, between two models. We count 
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Figure 2: Average area under the ROC curve with 
standard errors on the MajorMiner dataset for the 
discriminative RBM trained using loopy behef propa- 
gation with different damping factors. 

the number of tags for which each model performs 
better than the other at a 95% significance level. 

Implementation details / Running time In or- 
der to find the parameters that worked best for the 
DRBM, we used a grid search. To avoid a prohibitive 
number of combinations, we settled on a learning 
rate and number of hidden units before exploring 
gradient approximations, Loopy Belief Propagation 
damping factors, and numbers of iterations for CD, 
MF-CD or LBP. We also performed a much wider 
parameter search on the smaller datasets, MTurk 
and MajorMiner, keeping the same parameters for 
Last.fm, but varying the number of hidden units. We 
found that the DRBM is insensitive to the number 
of iteration steps while the computational cost in- 
creases considerably. Training time varies according 
to many details, but on average, to train a DRBM on 
MajorMiner took around 48 CPU-hours. 

4.3 Experiments 

Experiment 1 The first experiment measured the 
effectiveness of different settings of the smoothing 
hyper-parameter in loopy belief propagation, meant 
to aid the convergence of the algorithm. Figure 2 
shows the mean area under the ROC curve (AUC) 



Table 2: Average AROC across tags as a percent- 
age for each algorithm on each dataset with a speci- 
fied number of tags (Tgs) and with and without tag 
smoothing (Sm). 



Dataset Tgs Sm DRBM MLP RBM LOG SVM 



MTurk 


27 - 


68.8 


65.4 


65.4 


65.7 


62.3 


MTurk 


27 + 


68.4 


65.6 


64.6 


66.7 


66.0 


MTurk 


77 - 


65.9 


65.8 


62.9 


63.4 


59.2 


MTurk 


77 + 


65.9 


66.1 


62.4 


64.6 


64.0 


MajMin 


77 - 


76.1 


75.3 


70.0 


70.7 


64.5 


MajMin 


77 + 


74.8 


74.8 


68.2 


73.3 


71.5 


Last.fm 


70 - 


72.2 


72.0 


65.9 


70.3 


64.6 


Last.fm 


100 - 


72.4 


72.4 


66.1 


70.2 


64.5 



on the MajorMiner dataset of discriminative RBMs 
trained using loopy behef propagation (LBP) with dif- 
ferent damping factors. We use 10 training iterations. 
The plots show that the damping factor does not 
change the accuracy of the model appreciably. Very 
similar results were obtained on the MTurk dataset 
(not shown), while for Last.fm dataset we only use 
P = 0.9 which performed best on MTurk. 

Experiment 2 The second experiment compared 
discriminative RBMs trained and tested with differ- 
ent combinations of approximations to the intractable 
expectation in (7). We use different approximations 
on train and test to fully explore the space of possibil- 
ities. The left plot in Figure 3 shows the mean AUC 
of these discriminative RBMs on the MTurk dataset, 
while the right plot shows the same results for Ma- 
jorMiner. The four training approximations, in order 
of performance on MTurk, were contrastive diver- 
gence (CD), pseudo- likelihood (PL), loopy belief prop- 
agation (LBP) and mean-field contrastive divergence 
(MF). On MajorMiner the same order was preserved, 
except that loopy belief propagation outperformed 
pseudo-likelihood. The testing approximations, in 
order of performance were LBP and mean-field (MF). 
The training approximation had a larger impact on 
the final result than the testing approximation. For 
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Performance on MTurk (77) 
vs. training/testing approximations 



Testing approximation 
■ LBP MF 



li ii li i i 



PL LBP MF 

Training approximation 



Performance on MajMin (77) 
vs. training/testing approximations 

Testing approximation 

~| LBP MF 




PL LBP MF 

Training approximation 



Figure 3: Results on Mechanical Turk and MajorMiner comparing the performance of different approximations 
for the discriminative RBM during training and testing: contrastive divergence (CD), pseudo- likelihood (PL), 
mean field contrastive divergence (MF) and loopy belief propagation (LBP). The approximations used during 
training are represented on the x-axis, while the approximations used during testing are represented through 
the gray value of the bar. 



Last.fm we only used CD during training and LBP 
at test time. We also found that the model is quite 
robust to the number of training or testing iterations 
for CD, MF or LBP. 

Experiment 3 The third experiment compares the 
different classifiers on the three datasets with and 
without tag smoothing. We have also added a slight 
variation of the MTurk and Last.fm datasets restricted 
to a subset of the most popular tags (27 for MTurk 
and 70 for Last.fm). Using a two-sided paired t-test 
per tag, we compare all models to a discriminative 
restricted Boltzmann machine trained on unsmoothed 
data. The same test is done against all of the com- 
parison models: multi-layer perceptron (MLP), logis- 
tic regression (LOG), generative RBM (RBM), and 
support vector machines (SVM). Figure 4 shows the 
number of tags on which the DRBM outperforms the 
other algorithm. The DRBM outperforms all of the 
other algorithms on many more tags than it is outper- 
formed. The MLP is evenly matched to it on the full 
Last.fm 100 dataset, but on the other four datasets, 
the DRBM is significantly better on many more tags 
than it is worse. The SVM and logistic regression 
were previously the best performing algorithms on 
these datasets. 



Figure 5 shows the same analysis comparing each 
classifier trained on the raw, user-supplied tags to the 
same classifier trained on the tags smoothed by the 
proposed tag smoothing conditional RBM. Different 
subsets of the auxiliary inputs were compared and 
the smoothing that gave the best performance on 
the validation folds was selected. Because of the 
size of the Last.fm dataset, only the unsmoothed 
tags were tested. A number of interesting trends 
are visible in Figure 5. First, the SVM and logistic 
regression models are helped by the tag smoothing. 
This makes sense because they treat each tag as a 
separate classification task and cannot by themselves 
take advantage of the relationships between tags. The 
MLP was sometimes helped by tag smoothing, but 
generally was not. The fact that the RBMs were 
not helped by the tag smoothing suggests that they 
are able to capture by themselves the relationships 
between tags and do not need the assistance of the 
tag smoothing. 

5 Conclusion 

This paper has described two applications of con- 
ditional restricted Boltzmann machines to the task 
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DRBM vs. MLP DRBM vs. LOG 




MTurk(77) MTurk (27) Last.fm (100) Last.fm (70) MajMin (77) MTurk (77) MTurk (27) Last.fm (100) Last.fm (70) MajMin (77) 

Datasets Datasets 



DRBM vs. RBM DRBM vs. SVM 




MTurk (77) MTurk (27) Last.fm (100) Last.fm (70) MajMin (77) MTurk (77) MTurk (27) Last.fm (100) Last.fm (70) MajMin (77) 

Datasets Datasets 



Figure 4: Comparison of discriminative restricted Boltzmann machine autotagging retrieval performance 
to multi-layered perceptron (MLP), logistic regression (LOG), (generative) restricted Boltzmann machine 
(RBM), and support vector machine (SVM). Each bar shows performance on a different dataset and its 
height is the number of tags on which a two-sided paired t-test showed one algorithm to be significantly 
better than the other in terms of area under the ROC curve. Tags that were not significantly different are 
not included in this plot. 



of autotagging music. The discriminative RBM was 
able to achieve a higher average area under the ROC 
curve than the previously best known system for this 
problem, the support vector machine, as well as the 
multi-layer perceptron and logistic regression. In or- 
der to be applied to this problem, the discriminative 
RBM was generalized to the multi-label setting and 
an in-depth analysis of four different learning algo- 
rithms for it were evaluated. The best results were 
obtained for a DRBM using contrastive divergence 
training and loopy belief propagation at test time. 
The performance of the SVM was improved signifi- 
cantly, although not to the level of the DRBM, by the 
purely textual tag smoothing conditional RBM. Both 
of these results demonstrate the power of modeling 
the relationships between tags in autotagging systems. 
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A Appendix: Pseudocode 

A discriminative RBM is based on the following energy function: 



^(y,h,x) : 



-h^Uy 



(11) 



where x is conditioned on. From this energy function, we 
can define a probability distribution over y and h as follows: 
p(y,h|x) oc e-^^y'"^'^). 

In the next sections, we describe the different approaches we 
evaluated for training such an RBM. 

A.l Contrastive Divergence 

The most straighforward approach is perhaps to train the RBM 
to maximise the conditional log-likelihood of the associated 
target vector y by gradient descent. To do so, we need to 
estimate the following gradient: 



de 



logp(yt |xt) = -IEh|. 



^(yt,h,xt) 



'y,h I xt 



^(y,h,xt) 



(12) 



Since the second IEy,h | xt intractable, we need to approximate 
it somehow. The contrastive divergence algorithm [11] proposes 
to replace this expectation by a point estimate at a sample , 
obtained by running a Gibbs sampling initialized at y for K 
iterations. Given a sample y^ and given xt, the expectation 
with respect to h is now tractable. 

Algorithm 1 describes the associated training update, given 
an example (y,x). In our notation, a ^ b means a is set to 
value b and a ^ p means a is sampled from the distribution p. 

Algorithm 1 Discriminative RBM training update 
using Contrastive Divergence. 

Input: training pair (y,x), number of iterations 

K and learning rate A 

# Positive phase 

y° ^ y, h° ^ sigm(c + VI^x + Uy^) 

# Negative phase ( we are doing CD-K here) 
for K iterations do 

-p(h|y^,x) 



y^+i^p(y|h^) 
j^fe+i ^ sigm(c - 
end for 



# Update 
for l9 G e do 

end for 



VI/x + [/y^+i) 



^^(y^,x,h^)) 



A. 2 Mean-Field Contrastive Diver- 
gence 

A non-stochastic alternative to contrastive divergence is mean- 
field contrastive divergence [24], where samples are replaced by 
expectations. This procedure is detailed in Algoritm 2. 

Algorithm 2 Discriminative RBM training update 
using Mean-Field Contrastive Divergence. 

Input: training pair (y,x), number of iterations 

K and learning rate A 

# Positive phase 

y^ ^ y, ^ sigm(c + Wx + Uy^) 

# Negative phase ( we are doing MFCD-K here) 
for K iterations do 

^fc+i ^ sigm{d + U^h^) 
g/e+i ^ sigm{c + Wx + /7y^+i) 
end for 

# Update 
for i9 G e do 

^ ^ ^ - A (^E(^,x,hO) - ^E{y^,^M)) 
end for 



A. 3 Loopy Belief Propagation 

Instead of using a sample y^ to approximate the intractable 
expectation, one could try to estimate directly the associated 
marginals required by this expectation. Specifically, those 
marginals are = p{hk = l|x) and = = M^)- 

Loopy belief propagation [15] is a popular algorithm for approx- 
imating such marginals. Algorithm 3 details this procedure for 
the discriminative RBM. The given algorithm computes mes- 
sages in log-space and, for computational efficiency, messages 
are normalized so that log-messages from zero-valued variables 
is (hence only messages from one- valued variables are passed). 

A. 4 Pseudo-likelihood training 

Finally, instead of approximately estimating the gradient of 
the log-likelihood logp(yt | xt), one could instead replace it by 
the pseudo-likelihood objective [18]: 

logPL(y I x) = ^ logp(3y, | y\,, x) (13) 

j 

= I ^) - log (p(y I ^) + p(yj I ^)) 

j 

and compute its gradient exactly. Algorithm 4 details the 
procedure for updating the RBM according to that criteria. 
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Algorithm 3 Loopy Belief Propagation algorithm for inference in discriminative RBM 
Input: training pair (y,x), number of iterations K and damping factor ^ 

^data ^^J^^ry^ 

# Update downwards (towards y) and upwards (towards h) messages 
for K iterations do 

^ti ^ ^^ti + (1 - ^) log (l + i^MUkj) - 1) sigm(cf + E,-*/, )) , V j, A: 

^Ij ^ ^^Ij + (1 - ^) log (l + (exp(t/fci) - 1) sigm((ij + Y^k^^k ^t*i)) ' ^ ^ 
end for 

# Compute estimated singleton and pair- wise marginals 
pLBP(^ . = i|x) ^ sigm{dj + Efc ^t,), V j 

p^^^{hk = l|x) ^ sigm(cf + Y.j mlj), V k 

numl] ^ d, + Y.k^^k ^^"^i? ^ + Er/i ' V j, A: 

num^J ^Ukj ^ numj.^ + num^], V j, 

p^^^(% = l,hk = l|x) = exp(num^p/ (exp(num^p + exp(num^p + exp(num^^)), V j, 



Algorithm 4 Pseudo-likelihood training update algorithm in discriminative RBM 
Input: training pair (y,x), learning rate A 

# Forward propagation 

^data ^ c + lyx + C/y 

logPL(y|x) ^0 
for j from 1 to |y| do 

p{yj I y\^-, x) ^ sigm {c^^- + log [l + exp(cf - UkjVj + - log [l + exp(cf - Ukjyj)] } 

log PL(y I x) ^ log PL(y | x) - yj logp{yj \ y\j-, x) - (1 - yj) log(l - p{yj \ y\^-, x)) 
end for 

# Backward propagation 
at/ ^0, dW ^0, dc^O 
for j from 1 to |y| do 

Qoutj ^ p{yj I y\j, x) — ddj ^ 9outj, 9hid ^ 
for k from 1 to |h| do 

dUkj ^ dUkj + Sout^- ((1 - yj) sigm(cf - Ukjyj + t//cj) + % sigm(4^*^ - t//cj%)) 

dhidk ^ doutj (sigm(cf - Ukjyj + Ukj) - sigm(cf - Ukjyj)) 
end for 

ac/ ^ ac/ + ahid y^, ai^ ^ ai^ + amd x^, dc^dc^ ohid 

end for 

# Update 

U^U-XdU,W^W-XdW,c^c-Xdc,d^d-Xdd 
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